118 research outputs found

    On the Theory of Spatial and Temporal Locality

    Get PDF
    This paper studies the theory of caching and temporal and spatial locality. We show the following results: (1) hashing can be used to guarantee that caches with limited associativity behave as well as fully associative cache; (2) temporal locality cannot be characterized using one, or few parameters; (3) temporal locality and spatial locality cannot be studied separately; and (4) unlike temporal locality, spatial locality cannot be managed efficiently online

    Comparisons between linear functions can help

    Get PDF
    AbstractAn example is provided of a sorting-type decision problem which can be solved in fewer steps by using comparisons between linear functions of the inputs, rather than comparisons between the inputs themselves. This disproves a conjecture of Yao [14] and Yap [16]. Several extensions are presented

    MiniAMR - A miniapp for Adaptive Mesh Refinement

    Get PDF
    This report describes the detailed implementation of MiniAMR - a software for octree-based adaptive mesh refinement (AMR) that can be used to study the communication costs in a typical AMR simulation. We have designed new data structures and refinement/coarsening algorithms for octree-based AMR and studied the performance improvements using a similar software from Sandia National Laboratory. We have also introduced the idea of amortized load balancing to AMR in this report. In addition to this, we have also provided a low-overhead distributed load balancing scheme for AMR applications that perform sub-cycling (refinement in time).Ope

    Damaris: How to Efficiently Leverage Multicore Parallelism to Achieve Scalable, Jitter-free I/O

    Get PDF
    International audienceWith exascale computing on the horizon, the performance variability of I/O systems represents a key challenge in sustaining high performance. In many HPC applications, I/O is concurrently performed by all processes, which leads to I/O bursts. This causes resource contention and substantial variability of I/O performance, which significantly impacts the overall application performance and, most importantly, its predictability over time. In this paper, we propose a new approach to I/O, called Damaris, which leverages dedicated I/O cores on each multicore SMP node, along with the use of shared-memory, to efficiently perform asynchronous data processing and I/O in order to hide this variability. We evaluate our approach on three different platforms including the Kraken Cray XT5 supercomputer (ranked 11th in Top500), with the CM1 atmospheric model, one of the target HPC applications for the Blue Waters postpetascale supercomputer project. By overlapping I/O with computation and by gathering data into large files while avoiding synchronization between cores, our solution brings several benefits: 1) it fully hides jitter as well as all I/O-related costs, which makes simulation performance predictable; 2) it increases the sustained write throughput by a factor of 15 compared to standard approaches; 3) it allows almost perfect scalability of the simulation up to over 9,000 cores, as opposed to state-of-the-art approaches which fail to scale; 4) it enables a 600\% compression ratio without any additional overhead, leading to a major reduction of storage requirements

    Comparing archival policies for Blue Waters

    Get PDF
    This paper introduces two new tape archival policies that can im- prove tape archive performance in certain regimes, compared to the classical RAIT (Redundant Array of Independent Tapes) policy. The first policy, PARALLEL, still requires as many parallel tape drives as RAIT but pre-computes large data stripes that are written contiguously on tapes to increase write/read performance. The second policy, VERTICAL, writes contiguous data into a single tape, while updating error correcting information on the fly and delaying its archival until enough data has been archived. This second approach reduces the number of tape drives used for every user request to one. The performance of the three RAIT, PARALLEL and VERTICAL policies is assessed through extensive simulations, using a hardware configuration and a distribution of I/O requests similar to these expected on the Blue Waters system. These simulations show that VERTICAL is the most suitable policy for small files, whereas PARALLEL must be used for files larger than 1 GB. We also demonstrate that RAIT never outperforms both proposed policies, and that a heterogeneous policies mixing VERTICAL and PARALLEL performs 10 times better than any other policy

    An Unstructured Parallel Least-Squares Spectral Element Solver for Incompressible Flow Problems

    Get PDF
    The parallelization of the least-squares spectral element formulation of the Stokes problem has recently been discussed for incompressible flow problems on structured grids. In the present work, the extension to unstructured grids is discussed. It will be shown that, to obtain an efficient and scalable method, two different kinds of distribution of data are required involving a rather complicated parallel conversion between the data. Once the data conversion has been performed, a large symmetric positive definite algebraic system has to be solved iteratively. It is well known that the Conjugate Gradient method is a good choice to solve such systems. To improve the convergence rate of the Conjugate Gradient process, both Jacobi and Additive Schwarz preconditioners are applied. The Additive Schwarz preconditioner is based on domain decomposition and can be implemented such that a preconditioning step corresponds to a parallel matrix-by-vector product. The new results reveal that the Additive Schwarz preconditioner is very suitable for the p-refinement version of the least-squares spectral element method. To obtain good portable programs which may run on distributed-memory multiprocessors, networks of workstations as well as shared-memory machines we use MPI (Message Passing Interface). Numerical simulations have been performed to validate the scalability of the different parts of the proposed method. The experiments entailed simulating several large scale incompressible flows on a Cray T3E and on an SGI Origin 3800 with the number of processors varying from one to more than one hundred. The results indicate that the present method has very good parallel scaling properties making it a powerful method for numerical simulations of incompressible flows

    CCL: a portable and tunable collective communication library for scalable parallel computers

    Get PDF
    A collective communication library for parallel computers includes frequently used operations such as broadcast, reduce, scatter, gather, concatenate, synchronize, and shift. Such a library provides users with a convenient programming interface, efficient communication operations, and the advantage of portability. A library of this nature, the Collective Communication Library (CCL), intended for the line of scalable parallel computer products by IBM, has been designed. CCL is part of the parallel application programming interface of the recently announced IBM 9076 Scalable POWERparallel System 1 (SP1). In this paper, we examine several issues related to the functionality, correctness, and performance of a portable collective communication library while focusing on three novel aspects in the design and implementation of CCL: 1) the introduction of process groups, 2) the definition of semantics that ensures correctness, and 3) the design of new and tunable algorithms based on a realistic point-to-point communication model

    Scheduling the I/O of HPC Applications Under Congestion

    Get PDF
    International audienceA significant percentage of the computing capacity of large-scale platforms is wasted because of interferences incurred by multiple applications that access a shared parallel file system concurrently. One solution to handling I/O bursts in large-scale HPC systems is to absorb them at an intermediate storage layer consisting of burst buffers. However, our analysis of the Argonne's Mira system shows that burst buffers cannot prevent congestion at all times. Consequently, I/O performance is dramatically degraded, showing in some cases a decrease in I/O throughput of 67%. In this paper, we analyze the effects of interference on application I/O bandwidth and propose several scheduling techniques to mitigate congestion. We show through extensive experiments that our global I/O scheduler is able to reduce the effects of congestion, even on systems where burst buffers are used, and can increase the overall system throughput up to 56%. We also show that it outperforms current Mira I/O schedulers
    • 

    corecore